Lazy-allocate error latency histogram on AggregateEntry#11478
Conversation
|
🟢 Java Benchmark SLOs — All performance SLOs passed
PR vs. master results
Commit: Load and DaCapo benchmarks can be triggered manually in the GitLab pipeline. Results will appear in the Benchmarking Platform UI after completion. |
| } | ||
| } | ||
|
|
||
| private byte[] emptyHistogramBytesCache; |
There was a problem hiding this comment.
this can be final and initialised early? I don't think it will make a big difference. Also the field should be placed up among the other fields and not among methods for clarity
There was a problem hiding this comment.
Addressed — field moved up to the instance fields block. Kept lazy rather than final/eager: Histogram.newHistogram() requires the Histograms factory to be registered, and SerializingMetricWriter is constructed during tracer startup before that registration completes, so eager init would throw. Added an inline comment on the field explaining this. The single-writer invariant (aggregator thread only) means no synchronization is needed either.
(Reply by Claude Sonnet 4.6)
There was a problem hiding this comment.
Another way that we could address the cycle is by placing this field inside a helper class. That would allow us to make the field final and let the reference be constant propagated.
Each AggregateEntry allocated two DDSketchHistograms in its constructor (ok + error latencies). DDSketchHistogram wraps a DDSketch + lazy store, roughly 60-80 bytes per histogram even when empty. Most spans aren't errors, so most entries' errorLatencies sit empty for life. Now the field starts null. recordOneDuration lazy-allocates on the first error; if no error ever lands on the entry, it stays null and ~80 bytes of empty-histogram overhead are reclaimed. Across a full 2048-entry table that's ~150 KB if 95% of entries never error -- the typical case. For the wire format, SerializingMetricWriter caches the serialized form of an empty histogram (~17 bytes) on first use and writes those cached bytes when an entry's errorLatencies is null. The cache is per-writer (not a global static) so each writer instance picks up the Histograms factory state at the time of its first report, avoiding races with test setup that registers the DDSketch factory at varying points. Trade-off: entries that DO see an error retain the histogram across clear() (just cleared, not nulled), so always-erroring entries allocate exactly once. Same total allocation as before for that case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
f2ee559 to
0c658dd
Compare
|
Hi! 👋 Thanks for your pull request! 🎉 To help us review it, please make sure to:
If you need help, please check our contributing guidelines. |
…ports - Move emptyHistogramBytesCache up to the instance fields block - Import java.nio.ByteBuffer and datadog.metrics.api.Histogram; drop FQNs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Addressed @amarziali's two comments:
(Comment written by Claude Sonnet 4.6) |
|
I want to collect benchmarking data before merging this. I'm still working on that. |
Summary
errorLatencieshistogram allocation until the first error is recorded on an entry. Most entries never see an error in their lifetime; previously each one carried a ~60-80 byte emptyDDSketchHistogramfor life.SerializingMetricWritercaches the serialized form of an empty histogram (~17 bytes) and emits those cached bytes when an entry'serrorLatenciesisnull, so the wire format is byte-identical to before.Background
Extracted from #11389, where the same change was bundled with cardinality- and peer-tag-related work. This PR is just the lazy-errorLatencies piece; it sits between #11382 and #11387 so it can ship without depending on the cardinality machinery in #11387.
Trade-off
Entries that do see an error retain the histogram across
clear()(cleared, not nulled). An always-erroring entry allocates exactly once. Same total allocation as before for that path.Throughput benchmarks
This is a heap-footprint change, not a CPU one — the consumer's hot path is unchanged. The bench suite was re-run anyway as a sanity check to confirm no throughput regression vs the #11382 base. Same machine state and JMH config as the rest of the stack's runs (8 producer threads, 2×15s warmup + 5×15s, 1 fork, throughput mode).
AdversarialHighCardinalityResourceHighCardinalityPeer#11478 vs #11382is within the per-run error bar on every bench (0.94×–0.97×) — statistically indistinguishable. The CPU-side hot path didn't change:recordOneDurationnow callserrorLatenciesForWrite()instead of reading a final field, but that's a single-field-load-and-branch on every entry's first error and a direct field load thereafter, which the JIT inlines flat.aggregateDroppedcounts are also in line with #11382, confirming the lazy field doesn't perturb the table-cap behavior.The actual win — the ~150 KB heap reclamation at full table cap when 95% of entries never error — isn't observable in a throughput bench. It would show up in
jol-based per-entry footprint inspection (one fewer histogram per entry) or in a long-running profile of allocated-bytes-per-cycle (errorLatencies allocation amortizes from "one per unique key" to "one per unique error-emitting key").Test plan
:dd-trace-core:test— metrics tests pass🤖 Generated with Claude Code